Interoperability, or making things speak to each other

Or, turning everything into numbers

In this studio, we're going to play around with some functions of interoperability - or, explore different data structures and how to make them connect up.

I find this kind of data science really frustrating (even though I work in data visualization), because it is really unintiuitive. It requires a weird kind of imagining, so if you feel like you're losing your footing, don't worry, you're not alone. It's vertiginous!

But, it means that we get to roll together some of the ideas we learnt about in our Tables studio, and our Algorithms studio, and prepare for our Images studio in a few weeks time.

Turning the social into numbers

Let's start by going back and getting some data from our last python studio.

Data from your Reddit API

Sign in to reddit using Google Chrome in a separate tab.

Then go to this page: https://www.reddit.com/prefs/apps

You should already have an app. If you don't, click create app

create app

In the form that will open, you should enter your name, description and uri. For the redirect uri you should choose http://localhost:8080

redirect uri

Now, let's import our packages and set up our API connection. You need to fill out your own ID details!

Now, let's scrape our subreddit. In the sub section you can choose your subreddit, and then use query to run a search term.

At the end we'll convert it into a panda data frame called "post_data" which we will use for later gymnastics, and save it to CSV for good measure.

For more info on the parameters you can request for a submission, see: http://lira.no-ip.org:8080/doc/praw-doc/html/code_overview/models/submission.html

Finding numbers in data

This next section, we're going to get used to different computational types and how they work together.

Let's see what our post_data from Reddit looks like:

Different data types have different properties which allow them to do things, or not do things. For instance, you can't plot a character on a graph.

In Python, these are the main data types (thanks to Shawn Ren for the graph):

So, let's check out the data types of our post_data data set:

We're seeing a lot of Python/Panda objects (because this is a dataframe, and which we will need to convert to use), but also some integers and floating points, which are numeric forms. This is awesome!

So, let's try plotting some data using matplotlib's pyplot. Most digital images are Cartesian (like maps!), meaning that they work on an x,y axis, where each pixel is assigned an x,y coordinate. This coordinate system, called algebraic geometry, combines spatial measurement forms with numeric forms.

So, you can set any of the int64 or float74 values against each other:

Okay, cool. But what about the time of the post. Take a look at the "created" column - this is a time stamp in Unix time, which is a universal time that is free from timezones:

Unix time (a.k.a. POSIX time or Epoch time) is a system for describing instants in time, defined as the number of seconds that have elapsed since 00:00:00 Coordinated Universal Time (UTC), Thursday, 1 January 1970, not counting leap seconds. It is used widely in Unix-like and many other operating systems and file formats. Due to its handling of leap seconds, it is neither a linear representation of time nor a true representation of UTC.

We're going to need to bring that into something readable to humans! So, let's convert it, and make sure that it's in a datetime format and that it looks about right to the human eye:

Now, let's plot the date compared to the number of comments?

So, we've been using the useful and classic matplotlib to do our graphics. But it's not really the best. Let's try another and see if we can get some more information. Let's use plotly

I'm not going to bore you with more graphs - but when you're feeling up to it, feel free to take a look at the different kinds of charts you can make and have a play around - you could even combine several reddit datasets!

https://plotly.com/python/

Turning our bodies into numbers

Now, let's turn to something a little more complicated, with some reflections on Wernimont's piece on the Quantified Self and explore some of the ways in which our bodies are made data.

I've located and exported my own (seriously incomplete, and didn't even realise I had authorised it) health data from my iPhone's Health App for a laugh.

When downloaded, this comes in a .zip format. When expanded, you get two files - export.xml is the one that we want.

XML, like geojson is good format for holding together different types of data in the same document (like we learned with geojson). But it's not super useful for python, so we're going to run the apple-health-data-parser created by Nicholas Radcliffe to "parse" or separate out the data into different CSV files. Then we can have a little look at it more closely.

Normally, we would run a .py file using the command line (like terminal), but Jupyter is friendly, and actually lets us run .py files like a command line from inside the notebook! So, making sure that the following are in the same folder (which they will be if you have downloaded this from github) - Interoperability_Studio.ipynb, apple-heath-data-parser.py and export.xml, let's try to do some parsing!

Awesome! Looks like like Apple has been secretly collecting four kinds of my data: flights of stairs climbed, how often and loudly I use my headphones, my step count and how far I walk. Let's explore some of this data.

We start by installing (if we haven't already) 3 libraries: numpy (or nummber python, num-py), pandas (our much loved data format), and glob, which helps us find data paths on our computers, the pytz time zone calculator, pyplot for making graphs, and datetime, which does as it says.

Okay, let's see what this data is all about.

And check out what kind of data we're working with here:

Lots of objects, again, and some messy time formats too. Let's clean up. We need to start with date-time - the data crosses a few timezones, I think, but I want to bring it into the one I'm in now - America/Los_Angeles.

Now, let's "parse" (or separate) out the different time sections:

And check it's lookin' good!

Coolios - as you can see above, EVERYTHING IS NUMBERS. SEPARATE CATEGORISED NUMBERS. What are those categories, you ask?

We can create some groups for each date, to see how many each day.

Now, let's save it to CSV for good measure, and so we can start visualising!

Time to turn numbers back into images.

What about weekday? Let's regroup our CSV and see what we find?

What about hours (bearing in mind time zones)

Let's combine the numeric representation of my lived mobilities. What about flights?

Let's parse it out again

And group it into dates

And save...

Attaching numbers to numbers

To be totally ridiculous, let's compare how many steps I take per day of the week, compared to how many comments on your chosen subreddit.

First, we need to parse out (again) the time/date data. Then, it's just like above, using "groupby", while paying attention to the column headers.

Have a mess around in your own time - compare, means to medians, and ask your friends in data science what it's all about, because honestly, it's just a strange kind of magic.

Turning words into numbers

Turning back to our subreddit, and channelling cultural analytics, let's look a little more closely at some text analysis and see what we can do!

Text works as a str or string:

A word is a string of individual letters, a sentence is a string of words!

(Strings are used a lot in the Digital Humanities and Text Processing - I'm a geographer, and still learning about strings, so bear with me!)

Let's start by grabbing a cell with an object from our post_data dataset. With a pandas data frame, everything works on a gridded position as well! You can use iloc (or location by position) to find particular cells. Let's start with row number 3:

Now, if you count down the list, body is number "7", so let's add that to get the cell.

Now, let's convert it from a panda object to a string, and give it a name, so we can do some analysis:

We can count how many characters are in the string:

Or what the 'n' letter of the string is (in the below example, 45th)

Counting words

If we wanted to be braver, we could even try to count the most common words all the posts in the "title" column:

So, there are many "to", "the", "of" .... These are called "stopwords". Let's create a new column with all the stopwords deleted so we can count again.

To do this we import an nltk dictionary which has a list of words.

Then we delete the stopwords from the title column and make a new column without the stopwords.

And try again...

Well done!

(as a bonus, you could turn this into a data frame if you wanted, and plot it as well! - though it's not a super interesting graph!)

Turning sound into numbers

Okay, let's try some data that we don't necessarily think of as numeric: sound.

Let's import some libraries to help us out with sound.

Now, let's import those libraries and read our file. We're directly reference the sound_sample.wav that is in your downloaded folder. And let's print the rate and the audio.

The output from the wavefile.read are the sampling rate on the track, and the audio wave data. The sampling rate represents the number of data points sampled per second in the audio file. In this case 44100 pieces of information per second make up the audio wave. This is a very common rate. The higher the rate, the better quality the audio.

Let's take a shape of the audio data a second of audio data!

Looking at the shape of the audio data it has ONE array, so it's a mono channel.

The data is stored as int16. This is the size of the data stored in each datapoint. Common storage formats are 8, 16, 32. Again the higher this is the better the audio quality

The values in the data represent the amplitude of the wave (or the loudness of the audio). The energy of the audio can be described by the sum of the absolute amplitude.

This will depend on the length of the audio, the sample rate and the volume of the audio. A better metric is power, which is energy per second...

Now, let's plot the amplitude of the track over time...

Another common way to analyse audio is to create a spectogram. Audio spectograms are heat maps that show the frequencies of the sound in Hertz (Hz), the volume of the sound in Decibels (dB), against time.

The result allows us to pick out a certain frequency and examine it

Okay, that's it for sound!

Turning images into grids into numbers

In the final section of thsi studio, we're going to use a mixture of matplotlib and another library imageio to examine how images work as computational data (and how they're all also secretly grids and numbers).

First, let's import imageio (matplotlib is already imported above), and drag in an image:

All digital images look like this (thanks Stanford for the image):

Just like your graphs above, they have an x and y axis.

Each pixel is made up of three values: red (r), green (g) and blue (b):

We will investigate this a little more in our image workshop, but for now, this provides us two ways of classifying (and so, searching through) the enormous data set that is an image: colour, and position.

First, let's check that your image is in 3 dimensions (or RGB)

Now, let's find the RGB value of a single pixel!

Can we split the layers so each image just shows the red, green and blue values?

What happens if we change the r value of the rows 50 to 150 to the full 255 intensity?

And finally, let's just highlight only pixel values that are higher than 180 in the r channel!

That's it for today! Don't forget to post your graph or image in the #studios slack channel